Skip to content

[BugFix] Fix PPL mixed text/keyword field type across wildcard indices (#4659)#5358

Merged
yuancu merged 3 commits into
opensearch-project:mainfrom
qianheng-aws:worktree-agent-a172fb60
Apr 22, 2026
Merged

[BugFix] Fix PPL mixed text/keyword field type across wildcard indices (#4659)#5358
yuancu merged 3 commits into
opensearch-project:mainfrom
qianheng-aws:worktree-agent-a172fb60

Conversation

@qianheng-aws

Copy link
Copy Markdown
Collaborator

Description

When querying wildcard indices where a field has conflicting types (text vs keyword) across indices, the Calcite script engine incorrectly uses DOC_VALUE retrieval for all shards. This causes shard failures on shards where the field is text (no doc_values), resulting in silently dropped documents.

Root cause: The LatestRule merge strategy non-deterministically picks one type when merging field mappings from multiple indices. When keyword is picked, RexStandardizer.visitInputRef() selects DOC_VALUE retrieval via OpenSearchTextType.toKeywordSubField(). On shards where the field is actually text, doc_values are not available, causing the script to return null and OpenSearch to silently swallow the shard failure.

Fix: Added TextKeywordConflictRule to the index mapping merge pipeline (between DeepMergeRule and LatestRule). When text and keyword types conflict across indices, the rule merges to OpenSearchTextType WITHOUT keyword subfields. This forces _source retrieval, which works universally across all shards regardless of the actual field type.

Related Issues

Resolves #4659

Check List

  • New functionality includes testing
  • Commits signed per DCO (-s)
  • spotlessCheck passed
  • Unit tests passed
  • Integration tests passed

opensearch-project#4659)

When querying wildcard indices where a field has conflicting types (text
vs keyword) across indices, the Calcite script engine incorrectly uses
DOC_VALUE retrieval for all shards. This causes shard failures on shards
where the field is text (no doc_values), resulting in silently dropped
documents.

Add TextKeywordConflictRule to the index mapping merge pipeline that
detects text/keyword conflicts and merges to text WITHOUT keyword
subfields. This forces _source retrieval which works universally across
all shards regardless of the actual field type.

Signed-off-by: Heng Qian <qianheng@amazon.com>
@qianheng-aws

Copy link
Copy Markdown
Collaborator Author

Decision Log

Root Cause: When querying wildcard indices (source=log*) where a field (e.g., msg) is text in one index and keyword in another, the LatestRule merge strategy non-deterministically picks one type. When keyword wins, RexStandardizer.visitInputRef() selects DOC_VALUE retrieval. On shards where the field is actually text (no doc_values), the script fails silently, and OpenSearch swallows the shard failure with search.default_allow_partial_results=true (default), dropping those documents from results.

Approach: Added TextKeywordConflictRule to the merge rule chain. When text/keyword types conflict across indices, the merged type becomes OpenSearchTextType without keyword subfields. This causes toKeywordSubField() to return null, forcing _source retrieval instead of DOC_VALUE. _source works universally for both text and keyword fields.

Alternatives Rejected:

  1. ScriptDataContext fallback (DOC_VALUE -> SOURCE): When getFromDocValue() returns null, fall back to getFromSource(). Rejected because null from doc_values is indistinguishable from an actual null field value — this would mask genuine nulls.
  2. Fast-fail on type conflict: Throw an error when incompatible types are detected. Rejected because it breaks valid use cases where users intentionally query across indices with different mappings.
  3. Per-shard type detection: Have the script engine detect the actual field type on each shard and choose the retrieval strategy accordingly. Rejected as too complex for the current architecture — the script parameters (SOURCES/DIGESTS) are fixed at query planning time, not per-shard.

Pitfalls:

  • The basic source=log* | fields msg query does NOT trigger the bug because it uses _source retrieval without script pushdown. Only queries with pushed-down script expressions (e.g., where upper(msg) = 'X') trigger the DOC_VALUE retrieval path.
  • OpenSearch silently swallows shard failures by default (search.default_allow_partial_results=true), which makes the bug appear as missing results rather than errors.

Things to Watch:

  • The fix forces _source retrieval for conflicting types, which is slower than DOC_VALUE for keyword fields. This is acceptable as correctness trumps performance, and the scenario (conflicting types) is inherently problematic.
  • The LatestRule still applies for other type conflicts (e.g., integer vs long). Those cases should be evaluated separately if similar issues arise.
  • The maintainer's long-term suggestion of per-shard query plan execution would provide a more optimal solution.

@github-actions

github-actions Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit 17e7410)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Core fix: Add TextKeywordConflictRule to merge pipeline with unit tests

Relevant files:

  • opensearch/src/main/java/org/opensearch/sql/opensearch/util/MergeRules/TextKeywordConflictRule.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/util/MergeRules/MergeRuleHelper.java
  • opensearch/src/test/java/org/opensearch/sql/opensearch/util/MergeRules/TextKeywordConflictRuleTest.java

Sub-PR theme: Integration tests for mixed text/keyword field type across wildcard indices

Relevant files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteMixedFieldTypeIT.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/CalciteNoPushdownIT.java
  • integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/4659.yml

⚡ Recommended focus areas for review

Lossy Merge

The mergeInto method always produces OpenSearchTextType.of() (plain text, no subfields), discarding any non-keyword subfields that may exist on the source or target OpenSearchTextType. If either side carries additional fields (e.g., english, raw, custom analyzers), those are silently dropped. Consider preserving non-keyword subfields from both sides.

@Override
public void mergeInto(
    String key, OpenSearchDataType source, Map<String, OpenSearchDataType> target) {
  // Always merge to text WITHOUT keyword subfields.
  // This forces _source retrieval, which works for both text and keyword fields.
  target.put(key, OpenSearchTextType.of());
}
Incomplete Coverage

The rule only handles Keyword as the conflicting counterpart to text. Other keyword-like types such as ConstantKeyword or Wildcard are not covered by isKeyword(). If those types exist in the codebase they would fall through to LatestRule and potentially re-introduce the same doc_values bug.

private static boolean isKeyword(MappingType mappingType) {
  return mappingType == MappingType.Keyword;
}
Test Isolation

The test indices are only created when they do not already exist (isIndexExist guard), but they are never deleted in a @AfterEach / @AfterAll method. If a previous test run left stale data (e.g., extra documents), subsequent runs may produce unexpected row counts and flaky failures.

private void createTestIndices() throws IOException {
  // Create index with msg as text type
  if (!isIndexExist(client(), LOG_TEXT_INDEX)) {
    String textMapping =
        "{\"mappings\":{\"properties\":{\"msg\":{\"type\":\"text\"},"
            + "\"idx\":{\"type\":\"integer\"}}}}";
    createIndexByRestClient(client(), LOG_TEXT_INDEX, textMapping);
    Request bulkReq = new Request("POST", "/" + LOG_TEXT_INDEX + "/_bulk?refresh=true");
    bulkReq.setJsonEntity(
        "{\"index\":{\"_id\":\"1\"}}\n" + "{\"msg\":\"status=200\",\"idx\":1}\n");
    performRequest(client(), bulkReq);
  }

  // Create index with msg as keyword type
  if (!isIndexExist(client(), LOG_KEYWORD_INDEX)) {
    String keywordMapping =
        "{\"mappings\":{\"properties\":{\"msg\":{\"type\":\"keyword\"},"
            + "\"idx\":{\"type\":\"integer\"}}}}";
    createIndexByRestClient(client(), LOG_KEYWORD_INDEX, keywordMapping);
    Request bulkReq = new Request("POST", "/" + LOG_KEYWORD_INDEX + "/_bulk?refresh=true");
    bulkReq.setJsonEntity(
        "{\"index\":{\"_id\":\"1\"}}\n" + "{\"msg\":\"status=200\",\"idx\":2}\n");
    performRequest(client(), bulkReq);
  }
}

@github-actions

github-actions Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to 17e7410
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Avoid unintended loss of keyword subfields in text merges

When both fields are text-like but one has a keyword subfield and the other does
not, the rule matches and mergeInto will strip the keyword subfield. However, this
case (text-with-keyword vs text-without-keyword) does not necessarily cause the
doc_values retrieval problem described in the class Javadoc — the issue only arises
when one shard has a keyword field (which has doc_values) and another has a text
field (which does not). Merging two text fields where one has a .keyword subfield
should instead preserve the keyword subfield for the shard that has it, or at
minimum this behavior should be explicitly tested and documented to avoid unintended
loss of keyword subfield access.

opensearch/src/main/java/org/opensearch/sql/opensearch/util/MergeRules/TextKeywordConflictRule.java [40-45]

-// Match when both are text but one has keyword subfields and the other does not
-if (isTextLike(sourceMapping) && isTextLike(targetMapping)) {
-  boolean sourceHasKeywordSub = hasKeywordSubField(source);
-  boolean targetHasKeywordSub = hasKeywordSubField(target);
-  return sourceHasKeywordSub != targetHasKeywordSub;
-}
+// Match when one is text-like and the other is keyword (doc_values conflict)
+// Note: text-with-keyword-subfield vs text-without is intentionally NOT matched here,
+// as both are text fields and do not cause doc_values retrieval failures.
+// If needed, add a separate rule for that case.
Suggestion importance[1-10]: 6

__

Why: This raises a legitimate concern that merging two text fields where one has a .keyword subfield may unnecessarily strip the subfield, potentially degrading query capabilities. The suggestion correctly identifies that this case is distinct from the text-vs-keyword doc_values conflict the rule is designed to fix.

Low
Add cleanup to delete test indices after tests

The test indices test_log_text_4659 and test_log_keyword_4659 are created in init()
but are never deleted after the test class finishes. This can cause test pollution
or failures on re-runs if the indices already exist with stale data. Add a
corresponding cleanup method (e.g., @AfterClass or override cleanUp) to delete the
indices after all tests complete.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteMixedFieldTypeIT.java [31-36]

 @Override
 public void init() throws Exception {
   super.init();
   enableCalcite();
   createTestIndices();
 }
 
+@AfterClass
+public static void cleanUpIndices() throws IOException {
+  deleteIndexByRestClient(client(), LOG_TEXT_INDEX);
+  deleteIndexByRestClient(client(), LOG_KEYWORD_INDEX);
+}
+
Suggestion importance[1-10]: 5

__

Why: The suggestion raises a valid concern about test pollution from missing index cleanup. However, the init() method already guards against re-creation with isIndexExist checks, partially mitigating the issue. The improved_code references deleteIndexByRestClient and a static client() call that may not match the actual test framework API.

Low
Document unconditional merge behavior clearly

The mergeInto method ignores the source parameter entirely and always produces a
plain OpenSearchTextType. However, the source argument passed by the caller
represents the incoming type being merged, and the existing target value is the one
already stored. The method should use the existing target value (or the source) to
decide the merge result, but currently it discards both and unconditionally
overwrites with a bare text type. This could silently discard valid type information
when the rule is invoked in an unexpected order (e.g., source=keyword, target=text).
Consider documenting or asserting the expected invariant, or deriving the result
from both inputs.

opensearch/src/main/java/org/opensearch/sql/opensearch/util/MergeRules/TextKeywordConflictRule.java [49-55]

 @Override
 public void mergeInto(
     String key, OpenSearchDataType source, Map<String, OpenSearchDataType> target) {
   // Always merge to text WITHOUT keyword subfields.
-  // This forces _source retrieval, which works for both text and keyword fields.
+  // This forces _source retrieval, which works for both text and keyword fields
+  // regardless of whether the field is text or keyword on a given shard.
+  // Both source and target are either text-like or keyword; the safe common type is plain text.
   target.put(key, OpenSearchTextType.of());
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion only changes comments/documentation without modifying actual logic. The existing_code and improved_code are functionally identical, just with slightly different comments. This is a documentation-only change with minimal impact.

Low

Previous suggestions

Suggestions up to commit c0459b5
CategorySuggestion                                                                                                                                    Impact
General
Add test index cleanup after test class

There is no corresponding cleanup (e.g., @AfterClass or tearDown) to delete the test
indices test_log_text_4659 and test_log_keyword_4659 after the test class runs.
Without cleanup, these indices will persist across test runs and could interfere
with other tests or cause false positives due to stale data.

integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteMixedFieldTypeIT.java [31-36]

 @Override
 public void init() throws Exception {
   super.init();
   enableCalcite();
   createTestIndices();
 }
 
+@AfterClass
+public static void cleanUpIndices() throws IOException {
+  deleteIndex(LOG_TEXT_INDEX);
+  deleteIndex(LOG_KEYWORD_INDEX);
+}
+
Suggestion importance[1-10]: 6

__

Why: The test class creates indices test_log_text_4659 and test_log_keyword_4659 but never deletes them, which can cause stale data issues across test runs. Adding a tearDown or @AfterClass cleanup is a valid improvement for test hygiene, though the existing_code and improved_code differ correctly.

Low
Use idiomatic null assertion method

Using assertEquals(null, result) is discouraged; assertNull is the idiomatic JUnit
assertion for null checks and produces a clearer failure message.

opensearch/src/test/java/org/opensearch/sql/opensearch/util/MergeRules/TextKeywordConflictRuleTest.java [143]

-assertEquals(null, result);
+assertNull(result);
Suggestion importance[1-10]: 3

__

Why: Using assertNull(result) instead of assertEquals(null, result) is the idiomatic JUnit 5 approach and produces clearer failure messages, but this is a minor style improvement with no functional impact.

Low
Preserve text-type metadata during merge

The mergeInto method ignores the source parameter entirely and always produces a
plain OpenSearchTextType. However, the rule also matches the case where both types
are text but differ only in keyword subfields — in that case, the source type's
other metadata (e.g., analyzer settings) is silently discarded. Consider at least
using the text-typed argument as the base when both are text-like, to preserve any
non-keyword field metadata.

opensearch/src/main/java/org/opensearch/sql/opensearch/util/MergeRules/TextKeywordConflictRule.java [49-55]

 @Override
 public void mergeInto(
     String key, OpenSearchDataType source, Map<String, OpenSearchDataType> target) {
   // Always merge to text WITHOUT keyword subfields.
   // This forces _source retrieval, which works for both text and keyword fields.
-  target.put(key, OpenSearchTextType.of());
+  OpenSearchDataType existing = target.get(key);
+  // If the existing target is already text-like, prefer it as the base (preserves metadata)
+  if (existing != null && isTextLike(existing.getMappingType())) {
+    target.put(key, OpenSearchTextType.of());
+  } else if (isTextLike(source.getMappingType())) {
+    target.put(key, OpenSearchTextType.of());
+  } else {
+    target.put(key, OpenSearchTextType.of());
+  }
 }
Suggestion importance[1-10]: 2

__

Why: The improved_code is functionally identical to the existing_code — all three branches of the if/else still call target.put(key, OpenSearchTextType.of()), so no actual metadata preservation is achieved. The suggestion identifies a real concern but the proposed fix doesn't address it.

Low

return false;
}
// Match when one is text and the other is keyword
if (isTextLike(sourceMapping) && isKeyword(targetMapping)) {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a field is text but has keyword subfield, and the other is keyword, should they match and use text only?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — this is intentional. Consider the scenario:

  • Index A: field msg is text with a keyword subfield (msg.keyword)
  • Index B: field msg is keyword

If the text-with-keyword-subfield wins the merge, toKeywordSubField() returns msg.keyword, leading to DOC_VALUE retrieval on the subfield path msg.keyword. But on Index B's shards, msg is just a keyword field — there is no msg.keyword subfield. DOC_VALUE retrieval for msg.keyword would fail on those shards, producing the same silent data loss as the original bug.

By matching this case and merging to plain text (without keyword subfields), we force _source retrieval, which works universally.

The existing test testMatchTextAndKeyword (line 25) covers the pure text-vs-keyword case, and this isTextLike check intentionally covers text-with-keyword-subfield-vs-keyword as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Curios that could this case be improved to a DSL with full DOC_VALUE retrieval?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curios that could this case be improved to a DSL with full DOC_VALUE retrieval?

@LantaoJin Text only field doesn't have doc value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is for If a field is text but has keyword subfield, and the other is keyword

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, the current implementation is sufficient for bugfix. Supporting above enhancement requires additional investigation.

if (isTextLike(sourceMapping) && isKeyword(targetMapping)) {
return true;
}
if (isKeyword(sourceMapping) && isTextLike(targetMapping)) {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as above — the reverse direction (keyword source, text-like target) also needs to match to ensure we always merge to plain text without keyword subfields.

… ...)

Signed-off-by: Heng Qian <qianheng@amazon.com>
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 17e7410

@yuancu yuancu merged commit a19a610 into opensearch-project:main Apr 22, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working bugFix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] PPL cannot properly handle fields with mixed keyword and text data types across multiple indices

3 participants